from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import numpy as np
import re
import pandas as pd
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import stop_words
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.metrics import calinski_harabasz_score, silhouette_score
from scipy.spatial.distance import euclidean, cityblock
from sklearn.base import clone
from wordcloud import WordCloud
%matplotlib inline
import bz2
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Reddit. com is coined as the "front page of the internet" which means that it is one of most popular and frequently visited website globally. Due to a lot of users, different topics and subreddit submission are being done everyday to the website. This in turn give the website a very wide array of topics to maintain. One popular subreddit is TIFU, which contains stories from different people commiting mistakes and acts of stupidity. This study aims to uncover the common mistakes people make based on the submissions under the TIFU subreddit. This was achieved by first extracting the topics/concepts of the submissions using Latent Semantic Analysis and afterwards, K-means clustering was done to group submissions with similar concepts/topics.
Based from these, 12 distinct topics / themes under the TIFU subreddits were identified, with topics ranging from sexual to work themes to romantic themes.
Due to the rise of technology, almost all the people in the world can already use the internet. The increase of users on the internet will in turn also increase the popularity of different websites. One of the most popular websites as of today is the Reddit.com.
Reddit.com is a massive collection of different kinds of forums on the internet[1]. Users can post different kinds of topics, share different news, share original content, and comment on different subjects and areas of specialties. Users can also criticize, describe, and reply to each other’s posts which makes it like a social media platform. It is the Top 6 in the current lineup of popular websites in the United States and it’s on the Top 15th Place on the Global Scale as of August 2020 based on Alex.com[2]. It boasts its popularity because a lot of users can post different kinds of content, from simple posts to very creative memes. It is tagline “front page of the internet” is the most suitable description for the website due to its wide array of forum topics.
Being one of the top websites in the United States, it is expected that a lot of people view reddit.com from a day to day perspective. The topics you can find on Reddit are very scattered due to the number of users, which means that the ideas and posts of the users might be very similar or very different from each other. The clustering of these posts would make the generalization of topics in Reddit much more stable. Possible clustering of themes on the different topics you can see on Reddit would prove to be very beneficial in understanding the likes and dislikes of people in general. The study aims to check and describe the possible clustering labels on one of the topics in Reddit specifically the Today I Fucked Up (TIFU) Section.
Sample Snapshot of Reddit can be seen on Figure 1[3]
Figure 1: Sample Screenshot of Reddit Main Page
Fucked Up is a collection of different posts, forums, and contents that specifically target the events where a user was able to experience a not so good event. The description of the TIFU section can be described as the events where the users do some events and moments that are ridiculously stupid. Usually, the contents of these sections are very funny and amusing since a lot of users share their personal experiences through pictures and memes. One thing to note is that even though there are a lot of different experiences shared by different users, there is a possibility that some of these posts are very similar to each other. Due to this, the study will try to cluster all the TIFU Submissions on the given dataset and describe the underlying descriptions on the resulting cluster..
Sample Snapshot of TIFU Section can be seen on Figure 2[4]
Figure 2: Sample Screenshot of TIFU SECTION
To properly address the problem, the researchers will be needing viable data on regarding the different post about the TIFU SubReddit Section of Reddit. Note that there are a wide array of selection for possible collection of data but for this study we will using a public dataset that originated from the Jojie Server.The Researchers will follow the general workflow defined below to arrive at a conclusion and recommendation.
A public dataset inside jojie server can be used with the following link:
/mnt/data/public/reddit/submissions/*.xz
The Link above is an xz compressed file and the lzma library is needed to be able to parse the data on the file which is represented by the program below.
As the files contained all the submissions on reddit, only the first 1,500,00 lines per file were parsed through. Additionaly, only entries under the "TIFU" subeddit was stored.
root = "/mnt/data/public/reddit/submissions/"
df_tifu = pd.DataFrame(columns=['titles'])
files = glob.glob(root + "/*.xz")
idx = 0
for item in files:
with lzma.open(item, "r") as f:
for i, line in enumerate(f):
if i == 1_500_000:
break
else:
entry = json.loads(line.rstrip())
try:
if entry['subreddit'].lower() == 'tifu':
idx+=1
df_tifu.loc[idx] = (entry['title'])
except:
pass
The data contains the titles of submissions from November 2017 to October 2018
df_tifu.head()
df_tifu.shape
Below are the steps in the data pre-processing:
The data was cleaned by dropping duplicates and any NaN values. The strings were also lowered.
tifu_clean = df_tifu.dropna().drop_duplicates()
tifu_clean.titles = tifu_clean.titles.str.lower()
tifu_clean.shape
Since the titles are composed of different word and titles, we will be using frequency-inverse document frequency TF-IDF vectorizer to be able to have vector representation of each title. This vectorizer was used because it de-emphasizes very frequent and rare words, and by using these, the essence of each title can be better captured.
For the implementation, scikit-learn's TfidfVectorizer was used. For this problem, a word is considered all alphabetic characters bounded by a white-space. Stop words, or very common words in the English languages was also included on the vectorization process. A max_df of 0.995 and a min_df of 0.005 was used to only include words that were used between 0.5% and 99.5% of the time in the corpus.
stopwords = ['tifu', 'today', 'i', 'fucked', 'up', 'and', 'my', 'to', 'wa']
vectorizer = TfidfVectorizer(token_pattern = r"[\w\']{2,}",
stop_words = list(STOPWORDS) + stopwords,
min_df = 0.005,
max_df = 0.995
)
bow = vectorizer.fit_transform(tifu_clean.titles)
bow = pd.DataFrame.sparse.from_spmatrix(bow, columns = vectorizer.vocabulary_)
bow
To be able to extract the text summaries or concepts from the documents (i.e. the reddit submissions), Latent Semantic Analysis (LSA) was performed. Also by using LSA, we can discard the dimensions and represent the documents in terms of the concepts, which could help improve quality of data representation.
The implementation is shown below.
An explained variance threshold of 90% was used when reducing dimensions through Truncated SVD. This reduced the number of features from 141 to 111.
svd = TruncatedSVD(n_components = bow.shape[1]-1,random_state=0)
svd.fit(bow)
plt.plot(range(1, bow.shape[1]), svd.explained_variance_ratio_)
plt.plot(range(1, bow.shape[1]), svd.explained_variance_ratio_.cumsum())
plt.axhline(0.95, ls = '--')
plt.title('Figure 4: Explained Variance Ratio')
n = np.argwhere(svd.explained_variance_ratio_.cumsum() > 0.9)[0][0]
print('Explained Variance >= 0.9 at n = {}'.format(n))
svd = TruncatedSVD(n_components = n,random_state=0)
X = svd.fit_transform(bow)
Examination of the dimensions reveal that the keywords are usually verbs but also some keywords such as 'accidentally', 'nsfw' and 'friend'.
for i in range(5):
order = np.argsort(np.abs(svd.components_[i]))[-10:]
plt.barh(np.array(vectorizer.get_feature_names())[order], svd.components_[i][order])
plt.title(f'SVD {i+1}')
plt.show()
With the documents now represented in terms of their concepts/topics via LSA, K-means clustering was done to cluster the submissions based on the topic.
The clustering was done through K-Means in the sklearn package. The clustering was seeded to make the work reproducible. The number of k-clusters was scanned from 4 to 13. We use the Sum of Squared Errors (SSE), Calinski-Harabasz Score (CH) and Silhouette score to determine the best number of k.
By examining the plots of these clustering scores below, a cluster value of k = 12 was chosen due to the slight elbow in SSE and a major spike in the Silhouette score. A higher number of clusters were chosen to be more descriptive of the themes of the submissions in the TIFU subreddit.
SSE = []
CH = []
Silhouette = []
start = 4
end = 13
for k in range(start, end + 1):
model = KMeans(n_clusters = k,random_state=1)
y = model.fit_predict(X)
SSE.append(model.inertia_)
CH.append(calinski_harabasz_score(X, y))
Silhouette.append(silhouette_score(X, y))
scores = list(zip([SSE, CH, Silhouette], ['SSE', 'CH', 'Silhouette']))
fig, ax = plt.subplots(3, figsize = (6,11))
for i in range(len(scores)):
ax[i].plot(range(start, end+1), scores[i][0], marker = 'o')
ax[i].set_title(scores[i][1])
plt.show()
model = KMeans(n_clusters = 12,random_state=1)
y = model.fit_predict(X)
tifu_clean['cluster'] = y
The EDA is to be conducted primarily through WordCloud, as visual examination of the features is a good way to extract the dominant themes in the cluster.
By examining each cluster's wordcloud, the following dominant themes can be established among the TIFU submissions.
Another insight from these cluster themes is that the TIFU subreddit constantly mentions 'girl' or 'girlfriend' but never the male counterpart. This implies that the TIFU subreddit submissions are written and read by a predominantly male audience, which is consistent with redditors being predominantly male.
def word_cloud(data, cols):
"""Returns word cloud representation"""
tot = len(data['cluster'].unique())
rows = tot // cols
rows += tot % cols
pos = range(1,tot + 1)
fig = plt.figure(dpi=200, figsize=(9*cols, 7*rows))
for k in range(tot):
cluster = ' '.join(data.query(f'cluster=={k}')['titles'])
wordcloud = WordCloud(stopwords = list(STOPWORDS) + stopwords,
background_color = 'white',
width=800, height=600,
random_state=1).generate(cluster)
ax = fig.add_subplot(rows,cols,pos[k])
ax.imshow(wordcloud);
ax.axis('off');
ax.set_title(f"Common Themes for Cluster {k}")
word_cloud(tifu_clean,2)